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Abstract Sequences of macromolecules have 
“signals” or patterns that arise from a number of 
sources, particularly from shared common history 
or phylogeny. We discuss methods for inferring 
evolutionary trees from these patterns or signals 
under five properties desired for an ideal method. 
These five desiderata are that the methods be 
efficient (fast), consistent, powerful, robust, and 
falsifiable. Our conclusion is that corrections for 
multiple changes in sequences are the most 
important factor for any method to be consistent. 
Most optimality criteria, including compatibility and 
parsimony, become consistent when the sequences 
have appropriate corrections for multiple changes. 
Conversely, virtually no methods are consistent 
without adjustments for multiple changes. Hadamard 
conjugations are used to illustrate relationships 
between different methods and then illustrated by 
combining it with the closest tree optimality 
criterion. The data used to illustrate these recent 
developments include DNA sequences used to study 
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the origin of chloroplasts and also New Zealand 
skinks (Leiolopisma spp). 


Keywords evolutionary trees; spectral analysis; 
parsimony; closest tree 


INTRODUCTION 


Reconstructing unobservable events that occurred 
hundreds of millions or more years ago is a 
fascinating scientific problem. Establishing the 
relationships for a set of taxa is an example of this 
problem. We do not observe directly these past 
events but, as suggested by Zuckerkand] & Pauling 
(1965), the genetic sequences of all organisms carry 
a record of their history—“living matter preserves 
inscribed in its organization its own past history”. 
From these sequences we infer the pattern of 
relationships of the species. It is not surprising, given 
the conceptual difficulty of reconstructing accurately 
these remote events, that there have been two 
common responses. These have been dogmatism, 
that a certain method is always the best, or despair 
that the real tree can ever be known. Clearly, having 
recognised these two extremes, we want to explore 
the middle ground of understanding the methods that 
are available, finding their limits, and extending and 
improving them. 

In this paper we develop the two themes that 
sequences carry signals or patterns arising from a 
number of causes, and secondly the five criteria that 
we would like any method of reconstructing trees 
to possess. When analysing the patterns in sequences 
with sites independent, it is convenient to group 
together sites with the same pattern or signal. For 
example, Fig. 1 shows the eight possible relative 
patterns for two-state characters (two colours) with 
four taxa and with symmetry between the two 
character states. For this analysis, the frequency of 
each of the eight patterns is counted and the results 
collected into a vector or represented as a histogram. 
The different signals or patterns may be observed 
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taxon l1 aaaaaaaa 
taxon2 abbaabba 
taxon3 ababbaba 
taxon4 abababab 

01234567 


directly in the data, or after processing to enhance 

signals of interest. 

Signals may arise for a large number of reasons, 
including: 

(1) a historical signal arising from shared common 
ancestry—this is the signal we wish to detect 
when reconstructing evolutionary trees; 

(2) convergence—this signal results from multiple 
changes, whether parallel, reversion, or 
convergent changes. Depending on the mechan- 
ism of change they may arise from unequal rates 
of change on different edges of the tree 
(Felsenstein 1978), the interspersion of long, 
short, and long edges of the tree (Hendy & 
Penny 1989), unequal GC contents in different 
sequences (Lockhart et al. 1992), or site specific 
effects (e.g., where some sites in the sequence 
are able to evolve and others not); 

(3) anon-tree model—most calculations assume the 
sequences evolved on a tree but if there has been 
recombination, hybridisation, or endosymbiosis, 
then a tree may be a too simple model and 
reticulate evolution may need to be considered; 

(4) computer error—which may arise from either 
program or hardware problems. All programs 
include some errors but in addition the algo- 
rithms used may be inadequate, they may be 
heuristic programs that do not consider all 
possibilities, the programs may be too slow to 
run in a reasonable time, or there may be 
inadequate corrections for multiple changes; 

(5) natural selection—when the same change has 
been selected on different lineages, for example, 
langur monkey stomach lysozymes (Irwin & 
Wilson 1990). In this example, sequences have 
converged under acidic conditions in a rumen 
but natural selection could equally lead to 
divergence of sequences, leading to an apparent 
“historical” signal; 

(6) the mechanism of change may be too simple— 
most methods assume that changes to the 
sequence are independent and identically 





01234567 
frequencies of patterns 
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Fig.1 Patterns, or signals, in the 
sequences. The eight possible 
patterns for four sequences of two- 
state characters indexed by a binary 
(powers of 2) system that makes 
calculation simpler (Hendy & 
Penny 1993). The number of 
positions in the sequence having 
each pattern is counted and 
expressed as a vector s (s for 
sequences) which is shown here as 
a histogram. These are the signals 
in the sequences. An indexing 
system for four-state characters, 
such as nucleotides, is available 
from the author. 


distributed (i.i.d) but we know in practice this 
is too simple for almost all cases and; 

(7) errors or false signals—this may arise from 
sampling error either from a short sequence or 
an unfortunate choice of an individual from a 
population, from sequencing errors, or from 
misalignment of the sequences. 

When inferring evolutionary trees, the signal of 
interest is the historical signal, that is “the patterns 
in the data that indicate the correct historical tree”, 
that have not arisen from multiple substitutions. 
Other patterns will be “noise”. However, in other 
research problems the signal of interest may be 
different. For example, a study on mutation mechan- 
isms may search for patterns arising independently 
in organisms with a high GC content. Similarly, a 
study on the function of a macromolecule may search 
for patterns arising on different lines of descent 
(Irwin & Wilson 1990). What is “signal” in one 
application may be “noise” in another. 

The second theme of this paper are the desiderata 
for a method for inferring evolutionary trees. We 
desire a method to be: efficient (fast), that is, able to 
complete a calculation in a reasonable or feasible 
amount of computing time; consistent, in that it 
should converge to the correct tree (the tree that 
generated the data) as very long sequences become 
available; powerful, it will converge quickly to the 
correct tree with relatively short sequences; robust, 
the method should not be sensitive to small 
deviations from the model; and falsifiable, it must 
be possible to show the data do not fit a particular 
model (Penny et al. 1992). There is nothing absolute 
in these five criteria and others may modify them. 
Statisticians will tend to combine the criteria of 
efficient and powerful under a single concept of 
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Fig. 2 The Hadamard con- 
jugation is an invertible relationship 
between an evolutionary model and 
sequence data. A model consists 
of: atree (A); rates of change along 
each edge (two of which are shown 
in 2A); and a mechanism of 
sequence change. Given such an t t, t3 
evolutionary model, the Hadamard 
conjugation calculates the 
probabilities of all signals (patterns) 
in the sequences (B). Given the 
probabilities of the patterns in 
sequences (2B) the Hadamard 
conjugation recovers the 
parameters in the original model. 
In practice, only estimates of the 
probabilities in 2B are known. 


A 


efficiency. We prefer to separate them because it 
acknowledges the major advances in theoretical 
computing over the last half century. Such studies 
have shown some potentially powerful (in our 
terminology) methods to require millions, or billions, 
of years to calculate and so can scarcely be called 
efficient. They may prefer the phrase “compu- 
tationally efficient” where we use just “efficient”. 
Although this overview for evaluating tree 
reconstruction methods was reviewed recently 
(Penny et al. 1992), there have been significant 
advances in understanding in the past year. Perhaps 
the main advance is understanding that criteria for 
selecting the “best tree” (optimality criteria) are, by 
themselves, neither consistent nor inconsistent. 
Many optimality criteria will be consistent if the 
sequences are corrected appropriately for multiple 
changes, but will not be consistent otherwise. In the 
past we focused on the optimality criteria them- 
selves, rather than on the adjustments for multiple 
changes. This conclusion is reinforced by the 
observation that methods such as neighbour joining 
(Saitou & Nei 1987) are both efficient and consistent 
(Studier & Keppler 1988); these two standards for 
evolutionary tree methods are compatible. The 
Hadamard conjugations are a useful framework for 
studying tree reconstruction methods in general. 


MATERIALS AND METHODS 


The Hadamard conjugation 


Before describing this process it is helpful to discuss 
the relationship between an evolutionary model and 
data, such as sequence data. In our terminology, an 
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Predicted data 


evolutionary model consists of three parts: (1) a tree 
(or more generally a connected graph); (2) a 
mechanism for change; and (3) probabilities for 
change along each edge of the tree. 

The simplest mechanism is that all changes are 
independent and identically distributed (i.i.d). From 
such a three-part model the expected properties of 
the sequence data can be calculated (Fig. 2). Methods 
such as current implementations of maximum 
likelihood (Felsenstein 1981) vary the parameters 
of the model to maximise the probability of obtaining 
the observed data (left to right in Fig. 2). In contrast, 
most other methods work in the reverse direction 
(right to left in Fig. 2) by starting with observed 
sequences and estimating parameters of the unknown 
evolutionary model. The Hadamard conjugation is 
an invertible calculation between the model and 
frequencies of sequence patterns, and hence goes 
equally in either direction (Hendy & Penny 1993). 
The Hadamard method is a form of discrete Fourier 
transform (Hendy & Penny 1993) and so it is 
appropriate to refer to it as spectral analysis. 

Figure 3 shows the steps in the calculation, 
starting with an evolutionary model in 3A, with the 
numbers on each edge being the probabilities that 
there is a character state change between the 
endpoints. The same model is expressed in 3B as a 
vector, p, the entries being the probabilities of 3A. 
The y (gamma) vector in 3C contains the same 
information about edges of the tree but is corrected 
for multiple changes along each edge, with Y5 = Ys = 
0, and Yo being minus the sum of the other terms. 
The Hadamard transform adds different y values to 
give actual path lengths connecting pairs of taxa. 
The results of this transform are in the p (rho) vector, 
3D. Some of the entries in p correspond to a path 
between a pair of taxa and thus are equivalent to the 
usual genetic distances after adjustment for multiple 
changes; thus distances are a subset of p. The 
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correction for multiple changes made between p and 
yis reversed between p and the next vector r (3E). 
This contains the observed lengths, without 
corrections for multiple changes, of the same path 
sets as in p. Again, r includes as a subset genetic 
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distance values but now without correction for 
multiple changes. Finally, the inverse of the 
Hadamard transform is applied to r give the vector 
s (3F), which has the expected frequencies of all 
patterns that would be observed in sequences, given 
the model in 3A. 

A Hadamard transform is a single application of 
multiplying a vector by the Hadamard matrix. In the 
overall calculation from 3C to 3F there is a 
Hadamard transform, a correction for multiple 
changes, then the inverse of the Hadamard transform. 
This process of applying a function, then a second 
function, followed by inverting the first is called a 
conjugation, so the overall process of going from 
3C to 3E (or the reverse) is a Hadamard conjugation. 
Note that going from 3B (p) to 3D (r), or the inverse, 
is also a conjugation. All calculations are invertible 
so that the calculation could start with s (patterns in 
sequences) and recover the values in y (3C) and then 
p (3B, probabilities that a change is observed along 
an edge of the tree in the model). 

Figure 3 also illustrates the points where different 
optimality criteria select a tree. Standard parsimony, 
compatibility, and methods using linear invariants 
(Lake 1987; Li et al. 1987; Fu & Li 1992) use the 
observed sequences, which is equivalent to the s 
vector. Distance methods, whether they be algori- 
thmic methods without an optimality criterion 
(Penny et al. 1992), or use a measure such as 
maximising the fit between the observed distances 
and distances on the tree, acts at either the r or p 
level. Without adjusting for multiple changes they 
act on a subset of values from r, when adjustments 
are made for multiple changes they use an equivalent 


Fig. 3 The intermediate steps of the Hadamard 
conjugation. The steps are shown starting from the model 
but as all calculations are invertible the figure could start 
from either end. A, Anevolutionary model shownas atree. 
B, The same model expressed as a vector p of probabilities 
of observing achange along edges of the tree; the two ways 
of describing the trees (3A and 3B) are equivalent. C, The 
y (gamma) vector containing the same information as 3B 
but corrected for multiple changes along each edge of the 
tree. D, The p vector resulting from applying the Hadamard 
transform to y to give path lengths for all sets containing 
an even number of taxa. E, The r vector of observed 
lengths of the same path sets in p after removing the effects 
of multiple changes along the paths. F, The vector s which 
has the expected frequencies that would be observed in 
sequences, given the model in 3A. All calculations are 
invertible so that the calculation could start with s and 
recover the values in y(3C) and p (3B). Genetic distances 
between pairs of taxa are shown as subsets of p and r. 
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Fig.4 Another view of the steps 3B—3F in Fig. 3 shows the relationships between the different steps in terms of whether 
the values in the vector are (1) observable directly (p, r, ands) or corrected for multiple changes (yand p), or (2) properties 
of path sets (p and r) or frequencies related to individual edges of a tree (p, y, and s). The Hadamard transform 
interconverts between properties of edges and path sets, and the correction for multiple changes interconverts between 


observable and total lengths. 


subset from p. The closest tree criterion (Hendy & 
Penny 1989) is used on y values. However, by 
analogy with distance methods that can be used with 
either observed or adjusted distances, optimality 
criteria such as closest tree, compatibility, and 
parsimony can all be used on either observed 
partition values (s) or inferred partition values (y). 
This observation has interesting consequences that 
are discussed later. 

The overall pattern of the calculation is illustrated 
in Fig. 4 for the five vectors p, y, p, r, and s. The 
Hadamard transform interconverts between vectors 
containing properties of edges of a tree (p, Y, and s) 


and vectors containing properties of path sets (p and 
r). To the left of the dashed line (4A and 4D) the 
vectors represent edges of a tree, on the right (4B 
and 4C) they represent sets of paths through the tree. 
The adjustment for multiple changes occurs between 
“observable” changes (p, r, and s) and those adjusted 
for multiple changes (y and p). The two diagrams 
above the dotted line (4A and 4B) are the changes 
observed between two endpoints; below the line they 
are adjusted to compensate for multiple changes 
along both individual edges and paths. Although p 
and s occur in the same box in 4A they are quite 
different; p contains information on what happened 
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(if the internal nodes could be observed) and s is 
what we observe in present sequences. The 
Hadamard conjugation is adjusting for all un- 
observed multiple changes whether they occur along 
individual edges of the tree or on different edges. 

The sequence data used in this study come from 
the following sources: sequences used to study the 
origin of the green plant chloroplasts (Lockhart et 
al. 1992); ribosomal RNA sequences from New 
Zealand skinks (Hickson et al. 1992); and sequences 
from a hemoglobin pseudogene from five primates 
(Miyamoto et al. 1988). 

The analyses have been done with a variety of 
programs that run on 386 or better PC micro- 
computers. The programs are: 


PREPARE—this program reads in sequences in a 
variety of formats, allows a variety of simple 
analyses of the sequences such as base frequencies, 
editing and selecting sequences to be studied, and 
finally outputs these sequences in formats suitable 
for programs such as PAUP, Phylip, Hadtree, and 
Trees; 


HADTREE~—reads in frequencies of patterns in 
sequences (from Prepare), carries out changes for 
multiple corrections using the Hadamard con- 
jugation, and searches for the closest tree (Hendy & 
Penny 1989) or an optimal tree evaluated by other 
optimality criteria. It allows models of evolution 
(tree, edge lengths, and a mechanism of change) to 
be entered. From this model the expected data may 
be calculated and saved, random sub-sets of data 
may be selected, and bootstrapping or other random 
samples can be selected (an application of this 
program will be given later). The program handles 
20 taxa with two-state characters or 11 with four- 
state characters; 


TREES—this program takes sequence data and 
allows a variety of analyses from calculating 
incompatibilities, distances, weighting of characters, 
a branch and bound search for minimal length 
(parsimony) trees, and has a variety of methods of 
evaluating or comparing trees by different tree 
comparison metrics; it allows various random- 
isations of data and does tests on trees based on 
these; 


HADVAR—a reduced version of HADTREE that 
has an additional feature of calculating the variance/ 
covariance matrix for up to 11 sequences with two- 
state characters; 
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TURBOTREE—this is another branch in bound 
program searching for the optimal tree. This program 
only uses the parsimony criterion but has been 
modified to use closest tree and incompatibility in 
HADTREE. It does have the advantage for par- 
simony of using four-state characters, such as 
nucleotide sequences, directly; 


DELUGE—a new program (Penny & Steel in press) 
that is effective in searching for an optimal solution 
amongst a large number of possible solutions. 


RESULTS 


Application 1 (model to data, and back to the 
model) 


Figure 5A is an evolutionary tree taken from Hendy 
& Penny (1989) and is an example where one lineage 
has evolved slower than the others. The expected 
numbers of changes are shown for the edges of the 
tree (3A) and again as a histogram (5A). Each edge 
of the tree in 3A is given an index so that the values 
can be expressed as entries in a vector, and those 
entries are then shown as the histogram in 5A. The 
Hadamard conjugation (Hendy & Penny 1993) is 
then used to calculate the expected number of 
observed changes in real sequences (Fig. 3B). 

In this example, the third entry, which represents 
the expected number of observed changes on the 
edge 3, joining taxa l and 2 to taxa 3 and 4, is smaller 
than the entries for 5 (representing the edge that joins 
taxa 1 and 3 together) and 6 (joining taxa 1 and 4 
together). The example used here is a case where 
parsimony, and indeed other optimality criteria (see 
later) applied directly to the observed patterns, s, in 
the sequences will give the incorrect tree—this is 
known as the Felsenstein paradox (1978). Here we 
see examples of the correct historical signals in 5A 
(in the unobservable model) but, in the observable 
data (5B), some “false” signals introduced by the 
multiple changes on different edges appearing as 
single changes along one edge. 

Figure 5C shows a random sample of 200 sites 
selected from the data in 5B. The inverse of the 
Hadamard conjugation is then applied to this sample 
to give Fig. 5D, which we would expect to be close 
to 5A. The differences between 5A and 5D are 
effects of sampling error, the signals that have arisen 
from sampling error rather than being historical 
signals. 

The data in Fig. 3—5 can be used to demonstrate 
the effect on an optimality criterion of using 
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Fig.5 Calculations from an evolutionary model. A, The model from 3A expresscd as the vector. B, The probabilities 
for the signals, s, in the observed data. C, A set of 200 sites randomly selected, with replacement, from SB. D, The gamma 


(y) vector calculated from SC. 


observed (s) or adjusted (y or p) values when 
selecting the optimal tree. Figure 6 shows the trees 
selected by parsimony, incompatibility, and closest 
tree on s values (Fig. 6B) and on y values (Fig. 6A). 
All of these three select the wrong tree from s and 
correct tree from Yy. But the values in s are prob- 
abilities expected from infinitely long sequences. 
Therefore, it is not a sampling error problem but a 
difficulty with the optimality criterion being applied 
to the observed sequence values. It is not valid to 
conclude that “parsimony” or “closest tree” are 
consistent/inconsistent per se, the conclusion 
depends on the data being used. This conclusion is 
one of the two important developments referred to 
in the Introduction; the realisation that appropriate 
adjustments for multiple changes are essential for 
most optimality criteria to be consistent. The other 
important development is that the criteria of being 
efficient (fast) and consistent are compatible. 
Some methods can guarantee, given very long 


sequences, to find the tree that generated the data 
(Studier & Keppler 1988; Charleston et al. 1993), 
while still being efficient in the sense that the time 
to complete the calculations increases in the 
polynomial way with the size of the problem. 
Programs in this class follow simple algorithms to 
build a single tree, rather than search for a global 
optimum. The methods include neighbour-joining 
(Saitou & Nei 1987), neighbourliness (Sattah & 
Tversky 1977), and a modification of neighbour- 
liness (Charleston et al. 1993). We have called these 
methods algorithmic (Penny et al. 1992) or 
constructive (Charleston et al. 1993). They use 
genetic distances to build a tree. 

That methods using genetic distances can be 
consistent raises the question of the relative 
importance of the information loss in converting 
sequences to distances. We have pointed out (Penny 
1982; Steel et al. 1988) that such a conversion loses 
much of the information in the data; it is not possible 
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selected by 

closest tree 
compatibility 

parsimony 
from y values 


to reconstruct the original patterns in the sequences 
from distances. Spectral analysis helps understand 
this apparent problem. With two-state characters and 
n taxa there are 2%! possible patterns (bipartitions) 
in the s, y, and q vectors (Table 1), but for distances 
there are only order n?, (O(n?)), values, one for each 
pair of taxa. Given only distance values it is 
impossible, in general, to recover the 2”-! values 
corresponding to the different edges of all possible 
trees. The answer to the apparent problem is that, at 
least with simulated “data”, we expect, when 
sampling error is eliminated, to have only O(n), (2n— 
3) values significantly greater than zero (n pendant 
edges common to all trees and n—3 internal edges 
that vary between trees; see Fig. SA). Thus, it is only 
necessary when the data fit the model precisely to 
estimate 2n—3 edge length values from n(n—1)/2 
distance values. There is thus no contradiction in 
having a method based on distance values being 
consistent. Thus, a method may be both efficient and 
consistent. The numbers of partitions, distance 
values, and edges of the tree being estimated for up 
to 10 taxa are given in Table 1. 

The current questions are then related to the 
power of algorithmic methods such as neighbour 
joining, their robustness, and their falsifiability. A 
powerful method in our terminology (Penny et al. 
1992) converges quickly to a single tree as the 
sequences get longer. Alternatively, the shorter the 
Sequence required to converge to a single tree, the 
more powerful the method. With simulated data we 
expect only 27-3 signals and, in this case, the rate 
of convergence (the power of the method) will 
depend largely on the values of the variance of the 
estimates after correcting for multiple changes. The 
variance on these 27-3 signals should be expected 
to be the important determinant on the rate of 


selected by 

closest tree 
compatibility 

parsimony 
from s values 
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Fig.6 Different trees selected by 
optimality criteria on observed and 
adjusted lengths. With observed 
values (s), closest tree, com- 

3 patibility, and parsimony all select 
the wrong tree (B) with the data 
from Fig. 5. But after adjustment 
for multiple changes (y or p) each 
selects the correct tree (A). All 

2 threecriteriaare thus “inconsistent” 
on observed values but “consistent” 
with adjusted values. Consistency 
is therefore not a property of the 
optimality criterion itself, but 
depends also on the data to which 
the criteria are applicd. 


convergence. One approach to estimating how well 
a small number of distances summarise the 
information in the original bipartitions (s values) is 
to convert bipartitions to distances, then use these 
to determine how accurately they allow the 
estimation of the original bipartition values. Results 
for such a calculation are given in Table 2 where 
we compare the original s values with those 
estimated from distances. 

However, with real data, there are usually more 
than 2n—3 signals in the data, and it is not clear how 
well algorithmic methods will work with real data 
where they are only using genetic distances. For 
example, if there are more signals than the number 
of genetic distances (n(n—1)/2), not all of which are 
independent, then the estimates could be quite poor. 
The number of patterns in real data is the subject of 
the next section. 





Table 1 Increase in values as the number of taxa 
increases. 
No. of No. of Parsimony No.of Edges in 


sequences bipartitions patterns distances binary tree 





n 27-1 2l (n+1) n(n-1)/2  2n-3 
4 8 c3 6 5 
5 16 10 10 7 
6 32 25 15 9 
7 64 56 21 11 
8 128 119 28 13 
9 256 246 36 15 
10 512 501 45 17 
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Application 2 (Sequence data to evolutionary 
model) 


With some datasets we have found very good 
agreement when we use the test (Penny et al. 1987; 
Lockhart et al. 1992) of comparing observed and 
predicted sequences. One of the best examples of a 
good fit is a dataset of five primate sequences of a 
hemoglobin pseudogene (Miyamoto et al. 1988). 
Pseudogenes are duplicate copies of genes where a 
mutation prevents an effective protein from being 
expressed. Consequently it is not expected there will 
be much selection, positive or negative, on these 
sequences, and so they may well fit simple models 
of evolutionary change. In this case the mechanism 
that all changes are independent and equally 
distributed (i.i.d) is a good first approximation 
(Hendy & Penny unpubl. data). Our expectation is 
that studies based on “simulated data” would be 
reasonable models for datasets, such as this one, that 
show a good fit to the model. 

With most real data the i.i.d model is too simple 
and the signals will be more complex. An example 
is a dataset with 10 New Zealand Leiolopisma skinks 
(see Hickson et al. 1992) where there are many more 
apparent signals in the data (Fig. 7). Figure 7A shows 
all 44 signals above a background cutoff, including 
10 patterns supporting pendant edges (the outer 
edges of the tree that join to each of the 10 skinks). 
A binary tree will have n-3 internal edges, so with 
simulated data for 10 taxa we expect about 17 signals 
(10 for pendant edges and seven for internal) rather 
than the 44 found here. 

We normally omit displaying signals for pendant 
edges because they occur in all trees. Instead we 


Table 2 Frequencies of s bipartitions estimated from 
distances. The expected values for the bipartitions s from 
the model in Fig. 5 (“original”) followed by the estimates 
of s from distance values in p, given distance values 
rounded to six decimal places (“6 dec”), and to two 
decimal places (“2 dec”). 








Number Original 6 dec 2 dec 
0 628.8 627.0 627.0 
1 157.5 159.3 162.0 
2 16.4 18.2 17.9 
3 17.3 15.5 15.2 
4 70.9 72.7 71.7 
5 19.1 17.3 17.3 
6 19.1 17.3 17.3 
7 70.9 72.7 71.7 
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concentrate on the (in this case 34) y values that 
correspond to possible internal edges of the tree (Fig. 
7B). Table 3 gives the index for these 34 signals, 
the strength of the signal as the expected number of 
sites that support them (adjusted for multiple 
changes), and the taxa in the signal. The closest tree 
(Hendy & Penny 1989) optimality criterion selects 
n—3 edges, and in Fig. 7C these seven are shown as 
positive values. Conversely, the 27 signals in y not 
in the tree are shown as negative numbers. This 
example is interesting in that neither of the two 
largest signals are included in the optimal tree. A 
method that works by a local optimisation technique, 
rather than by global optimisation, would tend to 
select one of these largest two signals. Showing both 
sets of signals, those in the optimal tree and those 
omitted, gives an immediate indication of signals 
contradicting the optimal (closest) tree. We find this 
particularly helpful in detecting support for other 
possible trees and a useful adjunct to bootstrapping 
and/or jackknifing. 

Another step is to analyse the signals omitted 
from the optimal tree. One approach (Table 4) 
simply counts the number of times each taxon occurs 
in an excluded partition (“exc” in Table 4). This 
value can be augmented by summing the y values 
of the excluded partitions. In practice, each excluded 
y value is divided pro rata between the taxa in the 
partitions, omitting cases where exactly half the taxa 
are excluded. In the present example, taxa 1 and 3 
(“Stewart Island Green” and L. grande) show few 
conflicts whereas taxa 2, 7 and 8 (“L1218”, L. 
zelandicum, and L. maccanni) show many conflicts. 
This example is chosen to represent a difficult case 
where there are many conflicting signals. We do find 
intermediate cases where it appears as if only one 
sequence is causing most of the difficulty; for 
example in the !Kung subset of the human mito- 
chondrial D-loop dataset (Vigilant et al. 1991). 


Falsifiability—allowing data to reject the 
model 


One systematic error that causes most methods of 
reconstructing trees to give the wrong tree is when 
sequences differ in their base composition (GC 
content). We reported (Penny et al. 1990) that 
methods that are consistent when the assumptions 
of the model are met, including a stable GC content, 
may become inconsistent when the GC content 
varied between sequences. Lockhart et al. (1992) 
calculated the general conditions under which the 
closest tree would fail to be consistent. In practice, 
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Fig. 7 Largest signals in the y vector for 10 skinks. A, Includes all the largest values, including those relating to 
individual taxa (pendant edges) which are in all trees. B, The same as (A) but excluding pendant edges. C, The same 
as (B) with the edges in the optimal tree above the zero line and those contradicting the optimal tree shown as negative 


numbers. 


we will not know whether or not we are in the region 
of consistency. 

Two approaches have been used to estimate the 
effect of any deviations from the same GC content. 
One is to compare the observed and predicted values 
from the optimal tree (Penny et al. 1987; Lockhart 
et al. 1992). We say that for a model to be scientific, 


the data must be able to reject the model (Penny et 
al. 1992). Another approach is a four-taxon GC test 
(Lockhart & Penny 1992) where the effect of 
unequal GC contents is subtracted from the observed 
parsimony signals. This is illustrated in Fig. 6 for 
four atpA sequences where two of them (Zea mays 
mitochondrial and chloroplast sequences) are coded 
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Fig.7 (continued) 


Table 3 Signals from 10 lizards, after adjustment for multiple changes. Partitions (signals) are ranked 
according to their expected frequency (“value”) after a standard adjustment for multiple changes. Pendant 
edges (partitions with only a single taxon) are omitted from this table as they occur in all possible trees. 
The index for each partition, “#’”, is the sum of binary numbers (1,2,4,8,16,...) for the taxa in the partition, 
or its complement, whichever is the smaller number. “258” thercfore indicates a partition including taxa 2 
and 9 (2 + 256); the partition with the next highest frequency, 5.244, is #130, which contains taxa 2 and 8 
(2 + 128). The optimal tree is selected from these partitions (see Fig. 6). The lizards are all New Zealand 
skinks and, except for one Cyclodina species (number 5), are Leiolopisma species. They are: (1) an 
undescribed skink “Stewart Island Green’’; (2) another undescribed species “L1218”; (3) L. grande; (4) L. 
microlepis, (5) Cyclodina aenea; (6) L. inconspicuum, (7) L. zelandicum; (8) L. maccanni;, (9) L. 
nigriplantare polychroma (from Twizel); (10) L. otagense (see Daugherty et al. 1990 and Hickson et al. 
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1+2 codon posn. atpA sequences for 4 taxa 


maize cp. 
(42) Synechococcus sp. (52) 


maize mt. (43) 


E.coli. (63) 


Observed 2 colour parsimony patterns 


(acca,caac) (acac,caca) (acca,caac) 
28 19 46 


Expected 2 colour parsimony patterns 


(acca,caac) (acac,caca) 


34 


(acca,caac) 


29 
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Fig. 8 The four-taxon GC test 
and chloroplast origins. The maize 
chloroplast and mitochondrial 
sequences are both AT rich and 
methods tend to place them together 
on the tree. 


Table 4 Numbers of times each taxon is in an excluded partition. For each 
taxon “exc” is the number of times it is in a partition not selected (excluded) 
from the optimal tree in Fig. 6. Each value (Table 3) for a partition excluded 
from the optimal tree is divided equally among the taxa and summed over all 
partitions. This helps identify particular taxa that are not fitting the model. In 
this example, taxon 2 has the highest sum and taxon 7 clashes most frequently 
with the optimal tree. This means that taxon 7 (L. zelandicum) clashes with 
many of the small signals but taxon 2 (“L1218”) clashes with many of the major 


signals in the dataset. 








Taxon exc Sum Taxon exc Sum Taxon exc Sum 

1 3 1.883 2 9 9.657 3 2 0.657 

4 5 2.801 5 7 4.972 6 8 4.578 

7 12 7.184 8 1l 9.364 9 5 5.485 
10 9 5.162 
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for in organelles (mitochondria and chloroplasts) that 
have genomes with a low GC content. 

The application of the four-taxon GC test is 
illustrated with first and second codon positions from 
the a-subunit of ATP synthetase (atpA), of 
Synechococcus sp, E. coli, and chloroplast and 
mitochondrial sequences from Zea mays. From a 
variety of evidence the correct tree would unite the 
Synechococcus and chloroplast sequences. In this 
case, most patterns do support this relationship (46 
in Fig. 8), but a sizeable number support uniting the 
chloroplast and mitochondrial sequences. The Zea 
mays (maize) chloroplast and mitochondrial 
sequences are both AT rich (42 and 43% GC) at the 
variable sites of the first and second codon positions. 

In this example most methods resolve the tree 
correctly, the cyanobacterium (Synechococcus) and 
the chloroplast sequences are united. Nevertheless 
there is a significant signal joining the two organelle 
sequences. However, subtracting the effect of these 
two sequences having similar, but low, GC contents 
shows that there is no strong historical signal for 
uniting these two. The problem cannot yet be 
handled in a general way. The usefulness of the four- 
taxon GC test is two-fold. It sometimes identifies 
cases where the signal from unequal GC content is 
so large that there is no historical signal left 
(Lockhart & Penny 1992) but in other cases, such 
as the one reported here, it gives more confidence 
that the historical signal is correct. We cannot yet 
solve the problem in general but the approach can 
pick up cases when it may or may not allow the 
correct tree to be found. It warns the researcher that 
current methods are outside their limit of application. 

The example illustrates one important use of the 
test, giving confidence that there is a sizeable 
historical signal. Other cases (Lockhart et al. 1992) 
detect examples when subtracting out the misleading 
GC patterns left no discernible historical signal. In 
these cases, no inference as to the correct phylogeny 
was possible. Thus, the test can work two ways, 
giving confidence in some cases, warnings in others. 


DISCUSSION 


A major conclusion of this work is that adjustment 
for multiple changes is important for a method to be 
consistent. Very few methods of tree inference are 
consistent without corrections (nonlinear) for 
multiple changes; the known exceptions are linear 
invariant methods, for example, Lake (1987) and Fu 
& Li (1992). The optimality criteria closest tree, 
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compatibility, and parsimony are not consistent on 
observed data but consistent after appropriate 
adjustment for multiple changes. 

The recognition that methods may be both 
efficient and consistent is also useful. The relative 
importance of information loss in converting from 
sequences to distances is less clear for real data. 
Although retaining all information is not essential 
for a method to be consistent, it may well affect the 
power of the method with real data when there are 
more than n—3 signals for internal edges of trees. It 
may also affect robustness and falsifiability. More 
complex simulations may be helpful when there are 
patterns, other than the historical signals, added to 
the simulations though the problem may be in 
obtaining results with a wide generality. Never- 
theless, it will be particularly interesting to determine 
how much effect small deviations from the assumed 
mechanism may have. 
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